A software-defined board support package (sw-bsp) for stand-alone reconfigurable accelerators

ABSTRACT

A software-defined board support package (SW-BSP) for stand-alone reconfigurable accelerators is provided. The adoption of emerging accelerators is key to achieving greater scale and performance in heterogeneous computing systems. A stand-alone accelerator protocol (SAP) allows for a hardware accelerator to be plug-and-playable in a stand-alone fashion (without needing a local central processing unit (CPU) host) and interact with a remote computing system agent for application acceleration across any network infrastructure. The SAP further facilitates a hardware-agnostic accelerator orchestration (HALO) software framework for hardware-agnostic programming with high performance portability and scalability in heterogeneous computing systems. The SW-BSP provides an implementation of the SAP on reconfigurable accelerators. Accordingly, embodiments described herein provide a flexible hardware-agnostic environment that allows application developers to develop high-performance applications without knowledge of the underlying hardware. This environment facilitates dynamic plugin of an accelerator onto the network fabric, which can be auto-discovered and utilized by applications.

RELATED APPLICATIONS

This application claims the benefit of provisional patent application Ser. No. 62/983,220, filed Feb. 28, 2020, the disclosure of which is hereby incorporated herein by reference in its entirety.

The present application is related to concurrently filed U.S. patent application Ser. No. ______ filed on ______ entitled “HALO: A Hardware-Agnostic Accelerator Orchestration Software Framework for Heterogeneous Computing Systems,” U.S. patent application Ser. No. ______ filed on ______ entitled “C²MPI: A Hardware-Agnostic Message Passing Interface for Heterogeneous Computing Systems,” and U.S. patent application Ser. No. ______ filed on ______ entitled “A Stand-Alone Accelerator Protocol (SAP) for Heterogeneous Computing Systems,” the disclosures of which are hereby incorporated herein by reference in their entireties.

FIELD OF THE DISCLOSURE

The present disclosure is generally related to heterogeneous computing systems, such as large-scale high-performance computing systems, data center computing systems, or edge computing systems which include specialized hardware.

BACKGROUND

Today's high-performance computing (HPC) systems are largely structured based on traditional central processing units (CPUs) with tightly coupled general-purpose graphics processing units (GPUs, which can be considered domain-specific accelerators). GPUs have a different programming model than CPUs and are only efficient in exploiting spatial parallelism for accelerating high-concurrency algorithms but not the temporal/pipeline parallelism vital to accelerating high-dependency algorithms that are widely used in predictive simulations for computational science. As a result, today's HPC systems still have huge room for improvement in terms of performance and energy efficiency for running complex scientific computing tasks (e.g., many large pieces of legacy HPC codes for predictive simulations are still running on CPUs).

In recent years, a few more accelerator choices for heterogeneous computing systems (e.g., HPC and other large-scale computing systems) have emerged, such as field-programmable gate arrays (FPGAs, which can be considered reconfigurable accelerators) and tensor processing units (TPUs, which can be considered application-specific accelerators). Although these new accelerators offer flexible or customized hardware architectures with excellent capabilities for exploiting temporal/pipeline parallelism efficiently, their adoption in extreme-scale scientific computing is still at its infancy and is expected to be a tortuous process (as was adoption of GPUs) regardless of their superior performance and energy efficiency benefits.

FIG. 1 is a diagram illustrating divergent execution flows of hardware-optimized application codes in existing HPC systems. The fundamental challenge to the adoption of any new accelerators in HPC, such as FPGAs and TPUs, is that each accelerator's programming model, message passing interface, and virtualization stack is developed independently and is specific to the respective hardware architecture. With the lack of clarity in the demarcation between hardware-specific and hardware-agnostic development regions, today's programming models require domain-matter experts (DMEs) and hardware-matter experts (HMEs) to work interdependently to make a significant effort in optimizing hardware-specific codes in order to adopt new accelerator devices in HPC and gain performance benefits. This tangled association is a self-imposed bottleneck from existing programming models that impairs a future in true heterogeneous HPC and severely impacts the velocity of scientific discovery.

SUMMARY

A software-defined board support package (SW-BSP) for stand-alone reconfigurable accelerators is provided. The adoption of emerging accelerators is key to achieving greater scale and performance in heterogeneous computing systems. A stand-alone accelerator protocol (SAP) allows for a hardware accelerator to be plug-and-playable in a stand-alone fashion (without needing a local central processing unit (CPU) host) and interact with a remote computing system agent for application acceleration across any network infrastructure. The SAP further facilitates a hardware-agnostic accelerator orchestration (HALO) software framework for hardware-agnostic programming with high performance portability and scalability in heterogeneous computing systems, such as high-performance computing (HPC) systems, data center computing systems, and edge computing systems. The SW-BSP provides an implementation of the SAP on reconfigurable accelerators. Accordingly, embodiments described herein provide a flexible hardware-agnostic environment that allows application developers to develop high-performance applications without knowledge of the underlying hardware. This environment facilitates dynamic plugin of an accelerator onto the network fabric, which can be auto-discovered and utilized by applications.

An exemplary embodiment provides a non-transitory computer readable medium having stored thereon software instructions that, when executed by a processor, cause the processor to establish a SW-BSP on a reconfigurable hardware accelerator. The SW-BSP includes a SAP module configured to communicate with a remote computing system agent over a network fabric and initialize local hosting of an application kernel; and a boot loader configured to initialize the SAP module on connection of the reconfigurable hardware accelerator to the network fabric.

Another exemplary embodiment provides a reconfigurable hardware accelerator. The reconfigurable hardware accelerator includes a reconfigurable accelerator processor; and a memory comprising instructions which configure the reconfigurable processor with a SW-BSP; wherein the SW-BSP provides an interface for connection to a remote computing system agent over a network fabric.

Another exemplary embodiment provides a method for providing a SW-BSP on a reconfigurable accelerator. The method includes initializing a SAP module for connecting to a remote computing system agent over a network fabric; receiving, over the network fabric, a request to execute a first computational function; and executing a high level synthesis on the reconfigurable accelerator in response to the request.

Those skilled in the art will appreciate the scope of the present disclosure and realize additional aspects thereof after reading the following detailed description of the preferred embodiments in association with the accompanying drawing figures.

BRIEF DESCRIPTION OF THE DRAWING FIGURES

The accompanying drawing figures incorporated in and forming a part of this specification illustrate several aspects of the disclosure, and together with the description serve to explain the principles of the disclosure.

FIG. 1 is a diagram illustrating divergent execution flows of hardware-optimized application codes in existing high-performance computing (HPC) systems.

FIG. 2 is a diagram illustrating unified execution flow of hardware-agnostic application codes provided by embodiments described herein.

FIG. 3 is a schematic diagram of an exemplary heterogeneous computing system according to embodiments described herein.

FIG. 4 is a diagram of an exemplary embodiment of a software-defined board support package (SW-BSP).

FIG. 5 is a flow diagram of a method for providing a SW-BSP on a reconfigurable accelerator.

FIG. 6 is a block diagram of a computer system using SW-BSP according to embodiments disclosed herein.

DETAILED DESCRIPTION

The embodiments set forth below represent the necessary information to enable those skilled in the art to practice the embodiments and illustrate the best mode of practicing the embodiments. Upon reading the following description in light of the accompanying drawing figures, those skilled in the art will understand the concepts of the disclosure and will recognize applications of these concepts not particularly addressed herein. It should be understood that these concepts and applications fall within the scope of the disclosure and the accompanying claims.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.

It will be understood that when an element such as a layer, region, or substrate is referred to as being “on” or extending “onto” another element, it can be directly on or extend directly onto the other element or intervening elements may also be present. In contrast, when an element is referred to as being “directly on” or extending “directly onto” another element, there are no intervening elements present. Likewise, it will be understood that when an element such as a layer, region, or substrate is referred to as being “over” or extending “over” another element, it can be directly over or extend directly over the other element or intervening elements may also be present. In contrast, when an element is referred to as being “directly over” or extending “directly over” another element, there are no intervening elements present. It will also be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being “directly connected” or “directly coupled” to another element, there are no intervening elements present.

Relative terms such as “below” or “above” or “upper” or “lower” or “horizontal” or “vertical” may be used herein to describe a relationship of one element, layer, or region to another element, layer, or region as illustrated in the Figures. It will be understood that these terms and those discussed above are intended to encompass different orientations of the device in addition to the orientation depicted in the Figures.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” and/or “including” when used herein specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms used herein should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

A software-defined board support package (SW-BSP) for stand-alone reconfigurable accelerators is provided. The adoption of emerging accelerators is key to achieving greater scale and performance in heterogeneous computing systems. A stand-alone accelerator protocol (SAP) allows for a hardware accelerator to be plug-and-playable in a stand-alone fashion (without needing a local central processing unit (CPU) host) and interact with a remote computing system agent for application acceleration across any network infrastructure. The SAP further facilitates a hardware-agnostic accelerator orchestration (HALO) software framework for hardware-agnostic programming with high performance portability and scalability in heterogeneous computing systems, such as high-performance computing (HPC) systems, data center computing systems, and edge computing systems. The SW-BSP provides an implementation of the SAP on reconfigurable accelerators. Accordingly, embodiments described herein provide a flexible hardware-agnostic environment that allows application developers to develop high-performance applications without knowledge of the underlying hardware. This environment facilitates dynamic plugin of an accelerator onto the network fabric, which can be auto-discovered and utilized by applications.

I. Introduction

HPC and other large-scale computing system applications have become increasingly complex in recent years. Predictive simulations with increasingly higher spatial and temporal resolutions and ever-growing degrees of freedom are the critical drivers for achieving scientific break-through. The latest advancements in deep learning paired with the next-generation scientific computing applications will inevitably demand orders of magnitude more compute power for future computing infrastructure. In the concluding days of Moore's law, general-purpose solutions will no longer be viable for continuing to meet such an exponential growth in performance that is required to keep pace with scientific innovations. This disclosure envisions that extreme-scale heterogeneous computing systems (e.g., HPC systems, data center computing systems, edge computing systems) that massively integrate various domain- and application-specific accelerators will be a viable blueprint for providing the necessary performance and energy efficiency to meet the challenges of future applications.

However, as described with respect to FIG. 1 , the path to realizing extreme-scale heterogeneous computing systems is tortuous. The main obstacle towards the proliferation of heterogeneous accelerators is the lack of a flexible hardware-agnostic programming model that separates the hardware-specific and performance-critical portion of an application from its logic flow, causing these divergent execution flows between both domain matter experts (DMEs) and hardware matter experts (HMEs). As a result, HPC and other large-scale computing system applications are by no means extensible with regard to new accelerator hardware.

This disclosure envisions that hardware-agnostic programming with high-performance portability will be the bedrock for realizing the pervasive adoption of emerging accelerator technologies in future heterogeneous computing systems. The proposed approach includes hardware-agnostic programming paired with a programming model that enables application developers, scientists, and other users to focus on conditioning and steering data to and from hardware-specific code without any assumption of the underlying hardware. Data conditioning and steering refers to the reorganization and movement of data within an application.

Additionally, performance portability is defined in the strictest sense as the ability for the host code to maintain a single hardware-agnostic control flow, as well as state-of-the-art kernel performance, regardless of platform and/or scale. Additionally, performance portability includes the ability to dynamically handle various accelerators without recompilation of the host code. This is in stark contrast to the current definition that allows for multiple control flows and recompilation processes.

FIG. 2 is a diagram illustrating unified execution flow of hardware-agnostic application codes provided by embodiments described herein.

Embodiments described herein deploy a set of hardware-agnostic principles to enable host code hardware-agnostic programming with true performance portability in the context of heterogeneous computing systems. The proposed hardware-agnostic principles impose a clear demarcation between hardware-specific and hardware-agnostic software development to allow DMEs and HMEs to work independently in completely decoupled development regions to significantly improve practicability, productivity, and efficiency.

To accomplish this, DMEs are restricted to conditioning and steering (orchestrating) data in and out of functional abstractions of hardware-optimized kernels. The hardware-agnostic abstraction of kernels in this regard can be defined by a label and its inputs, outputs, and state variables. Such a functional approach to hardware-agnostic programming is the key to the clear division between the responsibility of DMEs and HMEs. As a result, HMEs will focus on optimizing hardware-specific kernel implementations in their optimal programming environments while being able to eliminate the adoption barrier by leveraging a hardware-agnostic environment via a unified hardware-agnostic accelerator interface. Furthermore, DMEs will focus on application or algorithm development while being able to maintain a single code flow and effortlessly reap the performance benefits of new hardware accelerators by leveraging the hardware-agnostic environment via a unified hardware-agnostic application interface.

A. Heterogeneous Computing System

FIG. 3 is a schematic diagram of an exemplary heterogeneous computing system 10 according to embodiments described herein. The heterogeneous computing system 10 provides hardware-agnostic system environments through a C²MPI interface 12 with an application 14, a HALO framework 16, a SAP 18, and a software-defined board support package (SW-BSP) 20 for reconfigurable accelerators. These technologies provide a multi-stack software framework that includes stacks of core system agents with a modular architecture for hardware-agnostic programming and transparent execution with performance portability, interoperability, scalability, and resiliency across an extreme-scale heterogeneous computing system.

The communication message passing interface of various accelerators is unified for the heterogeneous computing system 10 by implementing a novel C²MPI standard described herein. The C²MPI interface 12 extends concepts from the existing message passing interface (MPI) standard and introduces new concepts of computation-centric communication to support interoperable program execution and communication across meshes of heterogeneous accelerators by leveraging remote procedure calls (RPCs). In C²MPI, a CPU and an accelerator process are treated uniformly by attaching a computation function attribute to each process to allow run-time programs to easily distribute data to various function-specific accelerator processes for highly scalable acceleration.

The HALO framework 14 is provided as an open-ended extensible multi-agent software framework that implements the proposed hardware-agnostic principles and C²MPI specification for enabling the portable and performance-optimized execution of hardware-agnostic application codes across heterogeneous computing devices. Dual-agent embodiments of the HALO framework 14 include two system agents, i.e., a runtime agent and a virtualization agent, which work asynchronously in a star topology. The runtime agent is responsible for implementing and offering the C²MPI interface 12, as well as being the crossbar switch for application processes and virtualization agents. The runtime agent also manages system resources, including device buffers, accelerator manifests, kernels, etc. The virtualization agent provides an asynchronous peer that encapsulates hardware-specific compilers, libraries, runtimes, and drivers. The runtime and virtualization agents implement common inter-process communication (IPC) channels for interoperability between multiple virtualization agents, which allows HALO to scale the number of accelerator types supported while maintaining the simplicity and structure of the framework.

Multi-agent embodiments of the HALO framework 14 consist of a set of core system agents (i.e., runtime agent, bridge agent, accelerator agent, virtualization agent) implementing a plug-and-play architecture for the purpose of scale and resiliency. The runtime agent and virtualization agent operate similar to dual-agent embodiments. The purpose of the bridge agent is to interoperate between the CPU and accelerator domains. The primary responsibility of the accelerator agent is to interconnect the entire accelerator domain and provide interoperability among heterogeneous accelerators across multiple nodes.

SAP 18 provides a new architectural standard for scalable stand-alone accelerators, facilitating implementation of large clusters of stand-alone accelerators via a network 22. Reconfigurable accelerators can implement the SAP using SW-BSP 20.

B. Definitions

The following terms are defined for clarity:

Accelerator: A computing system or hardware device that is programmed, configured, or designed to perform a certain computation routine. Programmed CPUs, programmed graphical processing units (GPUs), configured field-programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs) are examples of accelerators.

Standalone accelerator: An accelerator that does not require a local host but can be hosted remotely through a network connection.

Accelerator node (AN): A CPU-based server node that locally hosts one or more accelerators.

Platform node (PN): A CPU-based server node that handles the steering, scaling, security, resilience, and other traditional aspects of the virtualization stack.

Standalone accelerator protocol (SAP): An architectural standard for scalable standalone accelerators.

Independent accelerator integrated circuit (IAIC): An ASIC/FPGA that implements the SAP.

II. HALO Principles

HALO principles are the principles to keep in mind when developing hardware-agnostic specifications, programming models, and frameworks. The hallmarks of a hardware-agnostic system are to maintain an interface definition devoid of any vendor-specific, hardware-specific, or computational-task-specific implementations or naming conventions. Interfaces must also be domain-agnostic such that method signatures do not imply functionality but a delivery vehicle. For instance, a method called “execute(kernel1, parameter 1 . . . N)” is domain-agnostic, however, “kernel1(parameter1 . . . N)” is not. Additionally, hardware-agnostic and hardware-specific regions must be clearly defined and decoupled with a robust interoperation protocol.

Lastly, abstract functionality must be inclusive to procedures that operate on data and change state. Being domain-agnostic will allow for enormous flexibility and extensibility required to maintain an open-ended HALO software architecture, where HMEs can easily extend the overall system with new accelerator devices and kernel implementations. The purposes of a hardware-agnostic programming model are twofold. The first is to minimize the amount of hardware-dependent software in a codebase while maximizing the portability of the host code across heterogeneous computing devices. The second is to clearly separate the functional (non-performance-critical) and computational (performance-critical) aspects of the application to simplify the adoption of new accelerator hardware as well as the development and integration of hardware-specific and hardware-optimized interfaces/kernels.

III. SAP Design

A. Overview

The SAP is a protocol that defines functionality for a fused computation-communication model for interoperability of accelerators (including CPUs, FPGAs, digital signal processors (DSPs), ASICs, GPUs, tensor processing units (TPUs)) from 1 to n AN and PN nodes. The main motivating factor for this specification is to provide for a new breed of integrated circuits (ICs) that are subroutine-based called IAICs. These ICs may be reconfigurable, but at the heart have a list of supported subroutines, either on-chip, on-board, or on-network. The routines can be dynamically loaded, or immutable statically loaded during power on. There is a subroutine discovery process that allows the HALO framework to discover which hardware belongs to which subroutine dynamically. The discovery process is triggered during connection time. ICs in this class should have a minimum of the functionality described below.

B. SAP Concepts

SAP may have one or more of the following minimum capabilities. The below capabilities are communicated in a hardware-agnostic manner:

SAP may provide facilities to support hot-plug capabilities. Embodiments of SAP-compliant ICs are discoverable on any type of interconnect fabric dynamically, including Peripheral Component Interconnect Express (PCIe), Transmission Control Protocol/Internet Protocol (TCP/IP), InfiniBand (IB), Universal Serial Bus (USB), etc.

SAP may provide facilities to support dynamic auto-discovery on the interconnect fabric. This facility allows for the accelerator to be discovered dynamically during connection time. Connection time refers to the moment where the accelerator is plugged into a communication fabric. Examples include but are not limited to USB, PCIe, IB, TCP/IP, etc.

SAP may provide subroutine indexing capabilities. Subroutine indexing is a capability where the accelerators uniquely index runtime available subroutines. This is how a HALO framework discovers which subroutines are available on accelerators.

SAP may provide read-only configuration registers for conveying SAP specification feature implementation. The purpose of these registers is to convey to a HALO framework what features each IAIC has implemented. This functionality may include (but is not limited to): static/dynamic subroutine, reprogrammability, memory capacity, partitioning, virtualization levels, etc. Mainly the configuration registers can be implemented as a bitmask describing the optional concepts that it has implemented.

SAP may provide queue, handshaking, memory allocation, and flow control facilities for input and output data. This is the main facility for providing RPC functionality. The accelerator may be able to accept a request at a pending status (without acknowledgment) if and only if there is enough space to hold the request (not the data) on the fabric or on the IAIC. For example, with a globally unique identifier (GUID), a request could hold 32-bit source and destination address, input memory requirements, output memory requirement, etc. The handshaking capabilities provide synchronization capabilities for a request to move from “pending” to “ready.” To be in the ready status, the IAIC may return an acknowledgment if and only if the IAIC was successfully able to allocate memory for the input and output data.

SAP may provide facilities to query RPC state, and OPTIONALLY, timers, occupancy, and performance on a per-job or time frame bases. The purpose of this is to be able to track the state of the IAIC with regards to load. The HALO framework optimizes based on performance and is able to reroute requests on-the-fly. For this to be possible, the IAIC must be able to convey performance data.

SAP providing dynamic reconfiguration based on an application binary interface (ABI) may provide the capability to interpret subroutine definition files. This functionality facilitates interoperability. A complete standalone accelerator that can be reconfigurable must be able to read the binaries that were generated from their respective toolchains. This implies that the accelerator must be able to accept a binary file with any ABI and interpret whether it is a subroutine file belonging to that resource. If the binary belongs to it, it may claim ownership in a shared state (SS). If it does not recognize the binary, it may claim a not applicable (NA) state.

SAP may have the capability to retain the subroutine binaries on- or off-chip. The SAP has the option to cache or save the binary file on-chip for quicker access. If so, this functionality should be conveyed in the configuration registers to let the HALO framework know that it has caching capabilities and should forgo needlessly sending files across the network. HALO framework may reference the cache line where the subroutine lives along with the payload. Static SAP, or SAP that is not subroutine reconfigurable automatically forgoes this process and does not take part in the subroutine discovery process.

SAP may provide internal virtualization through flow control and memory space. The IAIC may not expose any virtualization primitives or logic externally. The virtualization may be masked behind a flow control mechanism.

SAP may provide facilities to partition, sequence, and reassemble data per the interconnect specification. During the “ready” state, the HALO framework sends data in packets. After receiving a packet, an RPC status of an IAIC goes from “ready” to “receiving,” indicating that packets have started to come in. After all the input data has been received and reassembled, the RPC status goes from “receiving” to “executing.” After the subroutine has completed its execution, the RPC status moves from “executing” to “ex_complete.” When an execution is complete, the “ex_complete” status moves to “sending,” and finally, “sd_complete.” At “sd_complete,” the buffers are deallocated and added back to the memory avail pool.

SAP may provide subroutine definitions in a standard fashion as described by the HALO specification.

SAP may provide facilities for canceling, interrupting, pausing and resuming an RPC. This functionality facilitates doing more intelligent quality of service (QoS). In case the HALO framework needs to context switch to another RPC, the SAP should take the proper steps to save the state of the RPC and resume it when appropriate. The command set for controlling an RPC includes the above command as well as hibernating a request; removing a request, and deprioritizing requests, during any state of the RPC. Exposing this functionality allows global optimization at scale. It is at the discretion of each SAP if they will leverage the functionality internally.

SAP may provide memory isolation for each RPC for security reasons.

SAP may provide a series of fault registers for unauthorized access to memory.

SAP may provide the capability to receive from at least ONE and send it to at least ONE. The sending may not include returning the information back to the source, but to a forwarding address. In some embodiments, the sending may both return the information to the source and to the forwarding address.

SAP may provide configurable heartbeat (or watchdog timers) capabilities. This functionality ensures that the hardware is still available for provisioning.

SAP may OPTIONALLY provide facilities to retain a list of senders and receivers for complex data aggregation schemes. This system is modeled by a many-to-many input meaning that an RPC can obtain its data from many sources and output the data to many destinations. It is important that the SAP have the ability to send and receive from multiple locations. The configuration registers may indicate how much capacity for sending and receiving addresses can be supported per work request.

SAP may OPTIONALLY provide the capability to update the destination list and retransmit the output data. The purpose of this functionality is to be able to override the destination to send to a new set of destinations.

SAP may OPTIONALLY provide the facilities to maintain a memory allocation until the deallocation request is requested. The purpose of this functionality is to provide for the retransmission of a new set of destination addresses.

SAP may OPTIONALLY have the capability to leverage any hardware-specific registers or functionality. This has limited capabilities. The hardware-specific information that can be conveyed is limited to information that was encoded into the binary of the subroutine definition files as well as key-value pairs coming from the framework if and only if the configuration files are targeting specific hardware. If the configuration file has generic definitions, hardware-specific information is not passed through.

SAP may OPTIONALLY have the capabilities to provide SAP-specific error codes, and messages particular to the hardware. Besides the required state, and error codes and messages, the IAIC developer can define more descriptive error codes and messages, and states. However, the system may treat these errors as unknown errors and passes them through as additional data.

IV. SW-BSP Design

A. Introduction

Currently, FPGAs and other domain-specific accelerators often require the presence of a host, often represented by a CPU-based system strongly coupled via PCIe. This has a considerable impact on scaling up and out, with regards to power and performance. In order to remove the bottleneck of a host CPU, accelerators must be able to be stand-alone on the network. Furthermore, by enabling standalone accelerators, new perspectives of old questions are raised in terms of networking such as topology, clustering, congestion, plug-and-play, network virtualization, and most importantly, regularization of discovery, resource management, dispatch, reliability, redundancy, and statusing must be revisited since domain-specific accelerators at the cluster level are relatively new and current accelerator development toolchains lack the velocity to production.

With current toolchains, it takes an impractical amount of time to develop hardware for performance-critical systems. The high level synthesis (HLS) toolchain of FPGAs massively improves the time it takes to develop critical performance that rivals that of ASICs. However, this toolchain is almost always relegated to application-level kernels, with the board support package (BSP) remaining in the traditional development domain described via hardware description language (HDL). In summary, to realize the early research into standalone FPGAs (and by proxy all ASICs) at cluster scale there needs to be a BSP that is as flexible to modify as CPU software.

Currently, there are no domain-specific accelerator clusters at the degree of current HPC clusters. The backbone of current heterogeneous clusters are nodes with CPUs strongly coupled with primarily GPUs, FPGA, and ASICs, with piecemeal software that lacks interoperability and scaling. With regards to FPGA HLS development flows, current BSPs are written entirely in RTL, causing the customization of BSP to be outside the reach of many software programmers. This makes it very difficult to update and customize board-level functionality.

B. Architectural Overview

FIG. 4 is a diagram of an exemplary embodiment of the SW-BSP 20. The purpose of the SW-BSP 20 is to minimize the development of RTL by migrating components of reconfigurable accelerators to HLS to improve development times and complexity by the orders of magnitude necessary to develop and deploy DSAs in a practical level. The SW-BSP 20 provides a middle layer to loosely couple traditional BSP functions and application kernels. In particular, the SW-BSP 20 uses this layer to implement the components necessary to enable standalone FPGA features including SAP, resource management, dispatch and statusing, and monitoring logic. The SW-BSP 20 also includes a modified board support region that incorporates a minimal set of components necessary for a BSP to function with existing flows and toolchains with minor modifications. The purpose of this architecture is to create an FPGA BSP (traditional BSP+SW-BSP 20) that can be used to enable accelerators to exist on the network without the need for a host processor.

The SW-BSP 20 includes a SAP module 26, a boot loader 28, an application module 30, and a board support module 32. Starting with the board support module 32, data flows in and out from an interface with the network 24 (PHY Channels 0,1). In the illustrated embodiment there are two channels, but can extend to any number of N channels. The board support module 32 further includes a detector/dissector submodule that interprets packet headers and is used by a selector submodule to decide the sinking component, which includes components connected to the register map, partial reconfiguration, and channel divider. The channel divider interfaces with the boot loader 28 and SAP module 26 through input/output (I/O) channels.

The SAP module 26 and the boot loader module 28 are two major components in the SW-BSP 20. The boot loader module 28 initializes the SAP module 26 at initialization. This is necessary to be compatible with existing toolchains. The SAP module 26 implements an Open Systems Interconnect (OSI) model for connection to a remote computing system agent (e.g., HALO 16). The application module 30 is initialized by the SAP module 26, and provides local hosting of one or more application kernels 34. Thus, the SW-BSP 20 enables developers to develop the application kernels 34 as application-specific computational functionality (i.e., matrix multiplication, linear solvers, machine learning models, etc) using the HLS toolchain.

C. Functionality of the SW-BSP

The board support module 32 implements submodules for memory, kernel control, power, and partial reconfiguration. The board support module 32 further contains the minimal features that are required to enable the SAP. The board support module 32 is fully operational during a partial reconfiguration cycle, and has the ability to partially reconfigure the SAP module 26, the application module 30, and application kernels 34 dynamically during runtime.

The board support module 32 has the ability to arbitrate between multiple channels of PHY I/O. The board support module 32 is further able to interpret SAP packets, as well as forward other packets to the application kernels channels (VCHAN0 and VCHAN1). The board support module 32 is able to write to the registry map independently of the application module 30.

The SW-BSP 20 uses higher level language (e.g., corresponding to the HLS toolbox) to implement all modules. The boot loader module 28 initializes the other SW-BSP 20 components. The SW-BSP 20 decouple the board support module 32 from a hosted application by inserting HLS interfaces (e.g., at the application module 30). The SAP module 26 interfaces with the board support module 32 and the application kernels 34 (e.g., via the application module 30).

The SAP module 26 implements the SAP protocol described above. The SAP module 26 controls the responses for resource management including memory, dispatch, discovery, and kernel identification. The SAP module 26 forwards all unrecognizable packets to the application channels.

The application module 30 keeps track of owners of kernel executions in the form of source requests and forwards results after completion back to the source (e.g., via the SAP module 26). The SW-BSP application module 30 further monitors and updates the host program with the state of the reconfigurable accelerator (e.g., FPGA). The SAP module 26 is able to respond to network packets at each layer.

The application module 30 is able to be dynamically reconfigured without disturbing the BSP module 32. The SW-BSP application module 30 is further able to start, stop, and monitor application kernels. The application module 30 is able to initiate a partial reconfiguration cycle.

The SW-BSP 20 does not contain any domain-specific functionality, and functions independently of the kernel applications.

V. Flow Diagram

FIG. 5 is a flow diagram of a method for providing a SW-BSP on a reconfigurable accelerator. Dashed boxes represent optional steps. The process begins at operation 500, with initializing a SAP module for connecting to a remote computing system agent over a network fabric. The process optionally continues at operation 502, with, on connection to the network fabric, registering with the remote computing system agent. In some examples, if the reconfigurable accelerator is powered on from off mode, it will need to first register to the remote computing system agent so the reconfigurable accelerator can be discovered by the remote computing system agent as a newly available SAP-enabled accelerator on the network.

The process continues at operation 504, with receiving, over the network fabric, a request to execute a computational function. The process continues at operation 506 with passing the request to an application module. The process optionally continues at operation 508, with, at the application module, selecting an application kernel suitable for executing the computational function. The process optionally continues at operation 510, with reconfiguring the reconfigurable accelerator to host the selected application kernel. In cases when the reconfigurable accelerator is already programmed with the selected application kernel (e.g., due to a previous request), the reconfiguration at operation 510 is bypassed.

The process optionally continues at operation 512, with executing the computational function. The process optionally continues at operation 514, with initializing a host control for the selected application kernel to monitor a status of the computational function. The process optionally continues at operation 516, with causing the SAP module to provide an indication of available resources to the remote computing system agent.

Although the operations of FIG. 5 are illustrated in a series, this is for illustrative purposes and the operations are not necessarily order dependent. Some operations may be performed in a different order than that presented. Further, processes within the scope of this disclosure may include fewer or more steps than those illustrated in FIG. 9 .

VI. Computer System

FIG. 5 is a block diagram of a computer system 500 using SW-BSP according to embodiments disclosed herein. The computer system 500 can be implemented as a reconfigurable accelerator in communication with a heterogeneous computing system. The computer system 500 comprises any computing or electronic device capable of including firmware, hardware, and/or executing software instructions that could be used to perform any of the methods or functions described above, such as providing SAP. In this regard, the computer system 500 may be a circuit or circuits included in an electronic board card, such as a printed circuit board (PCB), a server, a personal computer, a desktop computer, a laptop computer, an array of computers, a personal digital assistant (PDA), a computing pad, a mobile device, or any other device, and may represent, for example, a server or a user's computer.

The exemplary computer system 500 in this embodiment includes a processing device 502 or processor, a system memory 504, and a system bus 506. The system memory 504 may include non-volatile memory 508 and volatile memory 510. The non-volatile memory 508 may include read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and the like. The volatile memory 510 generally includes random-access memory (RAM) (e.g., dynamic random-access memory (DRAM), such as synchronous DRAM (SDRAM)). A basic input/output system (BIOS) 512 may be stored in the non-volatile memory 508 and can include the basic routines that help to transfer information between elements within the computer system 500.

The system bus 506 provides an interface for system components including, but not limited to, the system memory 504 and the processing device 502. The system bus 506 may be any of several types of bus structures that may further interconnect to a memory bus (with or without a memory controller), a peripheral bus, and/or a local bus using any of a variety of commercially available bus architectures.

The processing device 502 represents one or more commercially available or proprietary general-purpose processing devices, such as a microprocessor, CPU, or the like. More particularly, the processing device 502 may be a complex instruction set computing (CISC) microprocessor, a reduced instruction set computing (RISC) microprocessor, a very long instruction word (VLIW) microprocessor, a processor implementing other instruction sets, or other processors implementing a combination of instruction sets. Examples of the processing device 502 may include a Host CPU node, a CPU cluster, an FPGA or FPGA cluster, GPU or GPU cluster, or a TPU or TPU cluster. The processing device 502 may also be an application-specific integrated circuit (ASIC), for example. The processing device 502 is configured to execute processing logic instructions for performing the operations and steps discussed herein.

In this regard, the various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with the processing device 502, which may be a microprocessor, FPGA, a digital signal processor (DSP), an ASIC, or other programmable logic device, a discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. Furthermore, the processing device 502 may be a microprocessor, or may be any conventional processor, controller, microcontroller, or state machine. The processing device 502 may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration).

The computer system 500 may further include or be coupled to a non-transitory computer-readable storage medium, such as a storage device 514, which may represent an internal or external hard disk drive (HDD), flash memory, or the like. The storage device 514 and other drives associated with computer-readable media and computer-usable media may provide non-volatile storage of data, data structures, computer-executable instructions, and the like. Although the description of computer-readable media above refers to an HDD, it should be appreciated that other types of media that are readable by a computer, such as optical disks, magnetic cassettes, flash memory cards, cartridges, and the like, may also be used in the operating environment, and, further, that any such media may contain computer-executable instructions for performing novel methods of the disclosed embodiments.

An operating system 516 and any number of program modules 518 or other applications can be stored in the volatile memory 510, wherein the program modules 518 represent a wide array of computer-executable instructions corresponding to programs, applications, functions, and the like that may implement the functionality described herein in whole or in part, such as through instructions 520 on the processing device 502. The program modules 518 may also reside on the storage mechanism provided by the storage device 514. As such, all or a portion of the functionality described herein may be implemented as a computer program product stored on a transitory or non-transitory computer-usable or computer-readable storage medium, such as the storage device 514, volatile memory 508, non-volatile memory 510, instructions 520, and the like. The computer program product includes complex programming instructions, such as complex computer-readable program code, to cause the processing device 502 to carry out the steps necessary to implement the functions described herein.

An operator, such as the user, may also be able to enter one or more configuration commands to the computer system 500 through a keyboard, a pointing device such as a mouse, or a touch-sensitive surface, such as the display device, via an input device interface 522 or remotely through a web interface, terminal program, or the like via a communication interface 524. The communication interface 524 may be wired or wireless and facilitate communications with any number of devices via a communications network in a direct or indirect fashion. An output device, such as a display device, can be coupled to the system bus 506 and driven by a video port 526. Additional inputs and outputs to the computer system 500 may be provided through the system bus 506 as appropriate to implement embodiments described herein.

The operational steps described in any of the exemplary embodiments herein are described to provide examples and discussion. The operations described may be performed in numerous different sequences other than the illustrated sequences. Furthermore, operations described in a single operational step may actually be performed in a number of different steps. Additionally, one or more operational steps discussed in the exemplary embodiments may be combined.

Those skilled in the art will recognize improvements and modifications to the preferred embodiments of the present disclosure. All such improvements and modifications are considered within the scope of the concepts disclosed herein and the claims that follow. 

What is claimed is:
 1. A non-transitory computer-readable medium having stored thereon software instructions that, when executed by a processor, cause the processor to establish a software-defined board support package (SW-BSP) on a reconfigurable hardware accelerator, the SW-BSP comprising: a stand-alone accelerator protocol (SAP) module configured to communicate with a remote computing system agent over a network fabric and initialize local hosting of an application kernel; and a boot loader configured to initialize the SAP module on connection of the reconfigurable hardware accelerator to the network fabric.
 2. The non-transitory computer-readable medium of claim 1, wherein the SAP module implements an Open Systems Interconnect (OSI) model for connection to the remote computing system agent without being provisioned by a central processing unit (CPU).
 3. The non-transitory computer-readable medium of claim 1, wherein the SW-BSP further comprises an application module locally hosting the application kernel.
 4. The non-transitory computer-readable medium of claim 3, wherein the application module further loads the application kernel from a plurality of stored application kernels.
 5. The non-transitory computer-readable medium of claim 3, wherein the application module further provides a high-level synthesis (HSL) interface for the application kernel.
 6. The non-transitory computer-readable medium of claim 1, wherein the SW-BSP further comprises a board support module providing a physical layer connection to the network fabric.
 7. The non-transitory computer-readable medium of claim 6, wherein the board support module further implements memory, power, and kernel control for the SW-BSP.
 8. The non-transitory computer-readable medium of claim 6, wherein the board support module further provides partial reconfiguration logic to dynamically reconfigure the reconfigurable hardware accelerator during runtime.
 9. The non-transitory computer-readable medium of claim 6, wherein the board support module further receives and forwards SAP packets to the SAP module.
 10. A reconfigurable hardware accelerator comprising: a reconfigurable accelerator processor; and a memory comprising instructions which configure the reconfigurable processor with a software-defined board support package (SW-BSP); wherein the SW-BSP provides an interface for connection to a remote computing system agent over a network fabric.
 11. The reconfigurable hardware accelerator of claim 10, wherein the SW-BSP configures the reconfigurable hardware accelerator to connect to the remote computing system agent without being provisioned by a central processing unit (CPU).
 12. The reconfigurable hardware accelerator of claim 10, wherein SW-BSP supports connection to one or more of a Peripheral Component Interconnect Express (PCIe), Transmission Control Protocol/Internet Protocol (TCP/IP), InfiniBand (IB), or Universal Serial Bus (USB) network fabric.
 13. The reconfigurable hardware accelerator of claim 10, wherein the memory further stores a plurality of application kernels, each defining a subroutine for executing a computational function.
 14. The reconfigurable hardware accelerator of claim 13, wherein the memory further stores instructions which, when executed by the reconfigurable accelerator processor, cause the reconfigurable hardware accelerator to: receive a request to execute a first computational function from the remote computing system agent; and select one of the plurality of application kernels suitable for executing the first computational function.
 15. A method for providing a software-defined board support package (SW-BSP) on a reconfigurable accelerator, the method comprising: initializing a stand-alone protocol (SAP) module for connecting to a remote computing system agent over a network fabric; receiving, over the network fabric, a request to execute a computational function; and passing the request to an application module.
 16. The method of claim 15, further comprising, at the application module, selecting an application kernel suitable for executing the computational function.
 17. The method of claim 16, further comprising reconfiguring the reconfigurable accelerator to host the selected application kernel.
 18. The method of claim 17, further comprising initializing a host control for the selected application kernel to monitor a status of the computational function.
 19. The method of claim 15, wherein on connection of the reconfigurable accelerator to the network fabric, the SAP module is initialized and registers with the remote computing system agent.
 20. The method of claim 15, further comprising causing the SAP module to provide an indication of available resources to the remote computing system agent. 